Incremental Algorithm for Discovering Frequent Subsequences in Multiple Data Streams
نویسندگان
چکیده
In recent years, new applications emerged that produce data streams, such as stock data and sensor networks. Therefore, finding frequent subsequences, or clusters of subsequences, in data streams is an essential task in data mining. Data streams are continuous in nature, unbounded in size and have a high arrival rate. Due to these characteristics, traditional clustering algorithms fail to effectively find clusters in data streams. Thus, an efficient incremental algorithm is proposed to find frequent subsequences in multiple data streams. The described approach for finding frequent subsequences is by clustering subsequences of a data stream. The proposed algorithm uses a window model to buffer the continuous data streams. Further, it does not recompute the clustering results for the whole data stream at every window, but rather it builds on clustering results of previous windows. The proposed approach also employs a decay value for each discovered cluster to determine when to remove old clusters and retain recent ones. In addition, the proposed algorithm is efficient as it scans the data streams once and it is considered an Any-time algorithm since the frequent subsequences are ready at the end of every window. & Taniar, 2005; Goh & Taniar, 2004) . Unlike traditional static databases, data streams are continuous, unbounded in size, and usually with high arrival rate. The nature of data streams poses some requirements when designing an algorithm to mine them such as finding the frequent subsequences. For example, since data streams are unbounded in size and have high arrival rate, algorithms are allowed only one look at the data. DOI: 10.4018/jdwm.2011100101 2 International Journal of Data Warehousing and Mining, 7(4), 1-20, October-December 2011 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. This means that algorithms for data streams may not have the chance to revisit the data twice. To solve this problem a buffer is used to collect the data temporarily for processing. A sliding window model (Zhu & Shasha, 2002) can be used to buffer n values of a data stream. Also the algorithm should be incremental, which means that the algorithm does not recompute the results after every window, but rather it only updates and builds on computed results of previous windows. In this paper, we investigate finding frequent subsequences in multiple data streams. The approach of the proposed algorithm for finding frequent subsequences is by clustering subsequences of a data stream. A subsequence is considered to be frequent if the number of similar subsequences in a cluster is above a threshold value called support. Due to the challenging characteristics of data streams (continuous, unbounded in size, and usually with high arrival rate), the proposed algorithm is incremental, efficient and any-time algorithm. That is at the end of every window, the proposed algorithm does not recompute the clustering results of similar subsequences however it updates the previous clustering results. Therefore it employs a decay value for each discovered cluster to determine when to remove old clusters and retain recent ones. In addition, the proposed algorithm is efficient as it scans the data streams once and also it is considered an Any-time algorithm since the frequent subsequences are ready at the end of every window. Finding Frequent subsequences, or clusters of subsequences, can be used in many applications. For example, Network monitoring to discover common usage patterns, exploring common stocks’ trend in financial markets, which will lead to good prediction of their future behavior, discovering web click patterns on websites would help website administrators in more efficient buffering and pre-fetching of busy web pages and in the placement of advertisements, and finding the load pattern on busy servers would assist system administrators in placing a more efficient load balancing scheme. Applications like the aforementioned ones and the lack of efficient and incremental algorithms for finding frequent subsequences motivated us to do this work. Although there are many works on mining frequent itemsets over transactional data streams, little is done on mining frequent subsequences over streaming real-valued data. Also most of the works dealt with single a data stream, while the proposed algorithm deals with multiple data streams. The main contributions of this paper are: • The proposed algorithm is incremental because clustering results of a current window is built on results of previous windows and also it employs a decay value to remove old frequent subsequences and retain the most recent frequent ones. • The proposed algorithm is any-time algorithm since the clustering results of frequent subsequences are readily available at the end of every window. • The proposed algorithm is an exact algorithm since no approximation for the data is used. • The proposed algorithm is designed to be executed in parallel for multiple data streams. The rest of this paper is organized as follows. Section 2 discusses the related work. In Section 3, we present some background information and formally define the problem and propose a solution. The proposed algorithm is presented in Section 4. In Section 5, we discuss the results of our experiments and show the feasibility of our approach. Finally, we conclude the paper in Section 6.
منابع مشابه
Detection of variable length anomalous subsequences in data streams
We consider the problem of anomaly detection in data streams, which is the problem of extracting subsequences that do not match an expected behaviour. The main challenge for detecting anomalous subsequences from data streams in the existing techniques is to determine the lengths of the normal and anomalous subsequences. Therefore, creating a robust model for detecting the anomalous subsequences...
متن کاملEfficient Identification of Common Subsequences from Big Data Streams Using Sliding Window Technique
We propose an efficient Frequent Sequence Stream algorithm for identifying the top k most frequent subsequences over big data streams. Our Sequence Stream algorithm gains its efficiency by its time complexity of linear time and very limited space complexity. With a pre-specified subsequence window size S and the k value, in very high probabilities, the Sequence Stream algorithm retrieve the top...
متن کاملMining Frequent Patterns in Uncertain and Relational Data Streams using the Landmark Windows
Todays, in many modern applications, we search for frequent and repeating patterns in the analyzed data sets. In this search, we look for patterns that frequently appear in data set and mark them as frequent patterns to enable users to make decisions based on these discoveries. Most algorithms presented in the context of data stream mining and frequent pattern detection, work either on uncertai...
متن کاملDiscovering Frequent Tree Patterns over Data Streams
Since tree-structured data such as XML files are widely used for data representation and exchange on the Internet, discovering frequent tree patterns over tree-structured data streams becomes an interesting issue. In this paper, we propose an online algorithm to continuously discover the current set of frequent tree patterns from the data stream. A novel and efficient technique is introduced to...
متن کاملAn Efficient Incremental Algorithm to Mine Closed Frequent Itemsets over Data Streams
The purpose of this work is to mine closed frequent itemsets from transactional data streams using a sliding window model. An efficient algorithm IMCFI is proposed for Incremental Mining of Closed Frequent Itemsets from a transactional data stream. The proposed algorithm IMCFI uses a data structure called INdexed Tree(INT) similar to NewCET used in NewMoment[5]. INT contains an index table Item...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IJDWM
دوره 7 شماره
صفحات -
تاریخ انتشار 2011